Main contributions about TALL
CLiC-it 2023 The 9th Italian Conference on Computational Linguistics - Venice
Novembre 30 December 1, 2023
DSSR 2024 4th International Conference on Data Science and Social Research Naples
March 25-27, 2024
Ai-CODED, Interdisciplinary Workshops Jean Monnet Centre of Excellence
University of Naples LOrientale Naples, April 17, 2024
How can AI translate? International Conference of Linguistics - University of Naples Federico II Naples
April 20-23, 2024
SIS 2024 The 52nd Scientific Meeting of the Italian Statistical Society Bari
June 17-20, 2024
U-CHASS Research Group
Unit of Computational Humanities And
Social Sciences - University of Granada
Granada, February 13, 2024
JADT 2024
17th International Conference on the
Statistical Analysis of Textual Data
Bruxelles, June 25-27, 2024
Summer School in Science Mapping 2024
University of Naples Federico II
Naples, May 27-31, 2024
Our main contributions
about Text Mining
M. Aria, C. Cuccurullo, L. DAniello, M. Misuraca, M. Spano (2024). Comparative science mapping: a novel conceptual structure analysis with
metadata, Scientometrics. https://doi.org/10.1007/s11192-024-05161-6
L. D’Aniello, N. Robinson-Garcia, M. Aria, C. Cuccurullo (2024). Decoding Knowledge Claims: The Evaluation of Scientific Publication
Contributions through Semantic Analysis.arXiv preprint arXiv:2407.18646. Presentes at the28th International Conference on Science.
Technology and Innovation Indicators STI2024, Berlino, 18-20 settembre 2024.
L. D'Aniello, M. Aria, C. Cuccurullo, M. Misuraca, M. Spano (2024). Extracting knowledge from scientific literature with an integrated Text
Summarization approach. In A. Dister, D. Longrée (eds.), Mots competes textes déchiffrés (JADT24) Presses Universitaires De Louvain Vol.1
pp. 239-248. ISBN:978-2-39061-471-5.
M. Aria, C. Cuccurullo, L. D’Aniello, M. Misuraca, M. Spano (2022). Text Summarization of a scientific document: a comparison of extractive
unsupervised methods. In Proceedings of the 16th International Conference on Statistical Analysis of Textual Data (Vol. 1, pp. 67-73), ISBN
979-12-80153-30-2, VADISTAT Press/Edizioni Erranti.
L. D’Aniello, M. Spano, C. Cuccurullo, M. Aria (2022). Academic Health Centers’ configurations, scientific productivity, and impact: insights from the
Italian setting. HEALTH POLICY, p. 1-7, ISSN: 0168-8510, doi: 10.1016/j.healthpol.2022.09.007
M. Aria, C. Cuccurullo, L. DAniello, M. Misuraca, M. Spano (2022). Thematic Analysis as a new culturomic tool: the social media coverage on COVID-
19 pandemic in Italy, Sustainability, vol. 14(6), 3643
M. Misuraca, M. Spano (2020). Unsupervised analytic strategies to explore large document collections, in D.F. Iezzi, D. Mayaffre, M. Misuraca (eds.),
Text Analytics. Advances and Challenges, Springer, Heidelberg, pp. 17-28
M. Misuraca, G. Scepi, M. Spano (2021). Using Opinion Mining as an educational analytic: an integrated strategy for the analysis of students
feedback, Studies in Educational Evaluation. vol. 68 (100979), pp. 1-9
Free and Easy-to-Use Tools timeline
Reinert, M.: Alceste une méthodologie d’analyse
des données textuelles et une application:
Aurelia De Gerard De Nerval.
Bulletin of
Sociological Methodology
. 26, 24-54.
Sinclair, S. and Rockwell G.:
Voyant Tools
.
Web. http://voyant-tools.org
1982
Ratinaud, P.: IRAMUTEQ: Interface de
R pour Les analyses
multidimensionnelle de textes et de
questionnaires
https://www.iramuteq.org
Higuchi, K.: New Quantitative Text
Analytical Method and KH Coder
Software.
Japanese Sociological
Review
, 68, 334350.
1990
Lebart L. and Morineau A.:
SPAD: Systéme portable
d’analyse de données ,
Technical
report
, CISIA.
2009 2016 2017
TALL-App Key features
User-friendly interface:Designed for ease of use, requiring no programming skills,
making it accessible to all users
Versatile and general-purpose:Supports a wide range of text analysis tasks
Powerful and scalable:Built on R, leveraging a powerful and scalable platform
for statistical computing
Open source:Fully open source, promoting transparency, collaboration, and
customization
A comprehensive workflow
LETS PLAY WITH
TALL!
1. Download and install the most recent version of R (https://cran.r-project.org/)
2. Open R or Rstudio and install the latest version of TALL from GitHub by
typing the following codes in the console window:
install.packages("remotes")
remotes::install_github("massimoaria/tall")
3. Once installed, load the library and then digits tall()to open TALL Shiny
app:
library("tall")
tall()
How to install TALL
Forget about R - It’s time to explore with TALL!
From now on, you won’t need to write a single line of code!
Interface overview
The rest of the
TALL menu will be
displayed after
data are
imported or
loaded
TALL menu Donate
Contributors
Credit
TALL buttons
SE T T I N G S
A list of options for setting analysis parameters
PL AY
Start the process and execute the analyses
EXPORT
Download the graphs or export data in Excel
ADD TO RE P O RT
Save analysis results to a report
RE STO R E
Revert to the original data, restoring the initial object
DOW N LO A D
Export analyzed data in TALL as a .tall file format
Vector Space Model and Bag Of Words
The most widespread way of representing in a structured form a texts collection is through the so-
called vector space model (Salton et al., 1975)
Let's consider a collection 𝔻consisting of
q
documents (d1, …, dj, …, dq) for which we have a
corresponding vocabulary 𝕋of
p
terms (t1, …, ti, …, tp)
𝔻: (N1, …, Ni, …, Np)
d1= (n11, …, ni1, …, np1)dj= (n1j, …, nij, …, npj)dq= (n1q, …, niq, …, npq)
t1
tp
ti
d1
dj
dq
Each document is viewed as a vector in the vector space
𝕽pof the vocabulary terms, so regardless of the terms
actually used, it will always have dimension p.
Texts pre-processing steps
To make texts analyzable from a quantitative viewpoint, it is necessary to prepare them through a
pre-processing pipeline
TOKENIZATION
CLEANSING
NORMALISATION
LEMMATIZATION
LEXICALIZATION
SEMANTIC COMPRESSION
FILTERING
META-DATA
RAW TEXT
META-DATA
PRE-PROCESSED
TEXT
Import text from multiple file formats
Sample collections
BBC NEWS A collection of 386 short news stories published in the entertainment
section of the BBC News website.
The texts are in English.
BIBLIOMETRIX A collection of 444 scientific articles written in English in which the authors
used the Bibliometrix R package to perform systematic literature reviews.
The textual data are article abstracts, while the additional information
includes metadata such as the list of co-authors, first author, year of
publication, and journal name.
The abstracts have already been tokenized and POS tagged.
TWEETS
US AIRLINES
A collection of 14.640 tweets from U.S. airline travelers in
February 2015, capturing their expressed feelings on Twitter.
In the upper-right corner of each slide, an icon representing one of the sample collections used in TALL will be displayed
Sample collections raw data example
BBC NEWS Stars pay tribute to actor Davis.
Hollywood stars including Spike Lee, Burt Reynolds and Oscar nominee
Alan Alda have paid tribute to actor Ossie Davis at a funeral in New York.
[...]
BIBLIOMETRIX This study aims to investigate the intersection between organizational
learning (OL) and knowledge management (KM) in the tourism
and hospitality sector by conducting a bibliometric analysis. [...]
TWEETS
US AIRLINES
@VirginAmerica - amazing customer service, again!
💕💕
RaeAnn in SF -
she's the best! #customerservice #virginamerica #flying
In the upper-right corner of each slide, an icon representing one of the sample collections used in TALL will be displayed
Load TALL structured files
Save your progress and continue later at any time
Once the file is imported, the TALL Data Overview displays the following information:
Number of documents
Last modified date
Language of text
List of words manually added to the documents
The last analysis stage completed before export
A .tall file (Tall_Export_File_2024-11-06.tall)
can be re-imported into TALL.
This information helps TALL redirect
you to the appropriate page!
Import Wikipedia pages
Dataset visualization
Display the entire text of the document
Remove the document from the dataset
If a document has been removed,
you can restore the original
dataset by clicking this button!
Export data in Excel
BBC NEWS
IMPORT NEW DATA
Edit, divide, and add external
information
From the Edit menu, the following actions can be performed:
Split texts by dividing the whole into different parts using single or specific characters.
Select a sample of documents randomly extracted from the imported data.
Add additional information to bind to the list of documents.
Universal Dependencies for Linguistic Modeling
A language model must be downloaded before starting the annotation process, which includes
tokenization, PoS tagging, and lemmatization
TALL integrates pre-trained language models from
Universal Dependencies (UD)
UD is a collaborative open-source project providing
treebank annotations that are linguistically consistent
across more than 100 languages.
UD's linguistic models offer reliable tokenization, ensuring precise breakdown of text into analyzable units
https://universaldependencies.org/
Automatic Lemmatization and
PoS-Tagging through LLM
More than 100
languages!
BBC NEWS
Once the LLM model is applied
to the text(s), the Annotated
Text Table is created. Its length
corresponds to the total
number of tokens in the text.
Each token is assigned its
lemma and Part of Speech
Special entities tagging TWEETS
US AIRLINES
TALL identifies special entities across documents,
providing a frequency table that displays the
count of each entity across all documents. By
clicking on View, the frequency distribution of the
selected entity is shown
Special entities tagging TWEETS
US AIRLINES
Additionally, the annotated
text table is updated to
assign the identified special
entities to the
corresponding tokens
Semantic Tagging
Automatic Multi-word creation BBC NEWS
By identifying specific terms (tokens or lemmas),
multiword creation using the RAKE algorithm
identifies keywords from the text.
Remember to select which multiwords to include
in the dataset and then click on Apply List button!
Semantic Tagging
Multi-word creation by a list and Custom Term List
The multiword list allows for identifying sequences of individual terms through the terms included in the list (assigned PoS: MULTIWORD).
Terms in the Custom Term List are searched for and identified across lemmas, assigning them a personalized, customized PoS
A list of terms can be
imported to identify
matches within the
tokens or lemmas.
Building the lexical table
By juxtaposing the qdocument vectors, we can obtain a
p x q
matrix F, also known as the lexical
table, on which performing statistical analyses
It is possible to englobe additional information on documents by considering an aggregated lexical table
wi.
w.j
wij
1 . . . . j . . . . q
1
.
.
.
i
.
.
.
.
p
TERM-DOCUMENT MATRIX F
where:
wi.
represents the overall importance of term i in the collection.
w.j
represents the overall importance of the content of document
j
w.. represents the overall importance of the content of the collection
Filtered and Aggregated lexical table
If additional information (metadata) can be recorded for each document, it is possible to obtain
aggregated lexical tables
M is a fully disjunctive coded matrix that has the K modalities of a qualitative variable represented in
its columns
Each column of matrix G represents a sub-collection k (k = 1, ..., K): the interpretation follows the
same approach as that used for the marginal row distribution of the Fmatrix
F
1 . . . . . . . . . q
1
.
.
.
.
.
.
.
.
p
1
.
.
.
.
.
.
q
1 . . . . . . K
M
1 . . . . . . K
1
.
.
.
.
.
.
.
.
p
xF M = G
G
Filtering and grouping
BIBLIOMETRIX
PoS Tag Selection
BBC NEWS
Alongside the list of PoS assigned to terms in the data by the pretrained language model, PoS assigned to
Special Entities, Multiwords, or Custom Terms (e.g., HOLIDAY as a PoS for terms like Christmas) can also be selected.
Clicking Play will display the rest of the sidebar menu!
This step is crucial, as the selected terms will be the ones considered in all subsequent analyses!
OVERVIEW
Descriptive statistics
Main information BBC NEWS
Documents Sentences Terms Lexical Indexes
Descriptive statistics
Main information - explore
Clicking on the box will open a new window describing its content!
BBC NEWS
Descriptive statistics
Main information - Tables BBC NEWS
Vocabulary distribution Terms distribution by TF-IDF
Both distributions can be measured on either tokens and lemmas
Descriptive statistics
Wordcloud BBC NEWS
The Wordcloud parameters allow you to choose terms (tokens or lemmas) and adjust the size and number of terms
displayed. Additionally, in any analysis where a term can be selected, clicking on it opens a new window showing its
location, including the documents and sentences where it appears
WORDS
Most Used Words
BBC NEWS
Frequency distribution of
terms tagged with the
specified PoS
Most Used Words
BBC NEWS
Multiple methods for Topic Detection
Concepts are the glue that holds our mental world together
(Murphy, 2002)
Within a text, through the diverse words used by the author, a series of concepts are
expressed
One of the main goals of quantitative text analyses is to extract interesting information
from a textual body, highlighting the key concepts/themes (Misuraca and Spano, 2020)
non-probabilistic
approches
probabilistic
approcches
factorial-based
network-based
topic modelling
community detection
probabilistic lsa
latent dirichlet allocation
latent semantic analysis
correspondence analysis
Correspondence analysis
Correspondence Analysis (Benzécri, 1973) is a technique that decomposes a two-way or multi-way
contingency table into a series of factors, each representing a latent aspect of the association among the
observed data.
When applied to text data, it is typically called Lexical Correspondence Analysis (Lebart, 1993).
LCA identifies and visualizes hidden linguistic structures, revealing key themes and similarities between
documents based on shared vocabulary.
fi.
f.j
1 . . . . .
j
. . .
q
1
.
.
.
i
.
.
.
.
p
fij
Tandem analysis: LCA + Hierarchical cluster analysis
LCA and Hierchical Cluster Analysis can be applied in sequence using the approach referred to
as tandem analysis (Arabie and Hubert, 1994).
Texts pre-processing
and construction of
the lexical table
STEP 0
Lexical
Correspondence
Analysis
STEP 1
Cluster Analysis on
LCA factors
STEP 2
Correspondence analysis
Factorial plane BBC NEWS
Correspondence analysis
Factorial plane BBC NEWS
In the Options menu, several parameters can
be set, including whether to perform the CA
on documents, sentences, groups or
paragraphs, the choice between lemmas or
tokens, and the number of clusters and
dimensions to extract
Various graphical parameters can
be customized, including plane
features and dot size
Correspondence analysis
Dendrogram BBC NEWS
Correspondence analysis
Tables BBC NEWS
Singular Values Coordinates Contributes Cosines Squared
Clustering
Hierarchical Clustering BBC NEWS
Clustering
Hierarchical Clustering - Parameters BBC NEWS
Perform the analysis on
lemmas or tokens, specifying
which similarity indexes to use
and the number of clusters
and words to consider
Correspondence analysis and
Clustering features BBC NEWS
From lexical to co-occurrence matrices
It is possible to recover the context of term usage within documents in a collection by constructing a
co-occurrence matrix from a lexical table
Fbin At
0/1
aii
1 . . .
j
. . .
q
1
.
.
.
i
.
.
.
p
1
.
.
.
i
.
.
.
p
1 . . . .
i’
. . . .
p
Fbin FbinT
If term importance is represented
by presence/absence, aii
indicates the number of documents
in which terms iand i’ co-occur
To achieve greater granularity, the matrix At can be constructed as a
terms x sentences
matrix by
segmenting documents using strong separators (e.g., .!?)
From lexical to co-occurrence matrices (2)
Similarly, a
document x document
matrix Ad can be constructed, where each element represents the
number of terms shared between two documents
F
1 . . . j . . .
q
1
.
.
.
i
.
.
.
p
At
1
.
.
.
i
.
.
.
p
1 . . . .
i’
. . . .
p
Ad
1
.
.
j
.
.
q
1 . . .
j
. . .
q
Both At and Ad can also be created based on similarity or distance criteria to express relationships
between terms or documents
Main similarity measures in Text Mining include the Jaccard coefficient,cosine similarity,and
association strength
Network analysis
At
aii
1
.
.
.
i
.
.
.
p
1 . . . .
i’
. . . .
p
In network analysis, the
term x term
matrix Atis an adjacency matrix and can be represented as an
undirected weighted graph
G= (N, L, V)
Nset of nodes
Lset of edges between nodes
Vset of weights
The graph-based approach enables the use of various network analysis tools, both descriptive (e.g.,
betweenness, density) and analytical (e.g., community detection, network dynamics)
Community detection
A concept or topic can be viewed as a set of strongly linked terms within the network.
In network analysis, a group of nodes that are densely interconnected and sparsely connected to other
parts of the network is defined as a
community
(Wasserman & Faust, 1994).
Each subgraph represents a
concept/topic within the analyzed
document collection
Network Co-word Analysis
BBC NEWS
Network Co-word Analysis
Parameters BBC NEWS
Network analysis can be performed using
lemmas or tokens, with term co-occurrence
calculated across sentences, documents,
groups, or paragraphs.
Graphical parameters allow adjustments to
nodes and edges
To improve network visualization:
Label transparency reflects the centrality of
the corresponding node
To avoid label overlap within node groups, a
distance among nodes is measured and only
labels for nodes with the highest centrality
value are displayed
Don’t worry, hovering over a node with the
cursor, its label is showed!
Network Co-word Analysis TWEETS
US AIRLINES
This is a network where nodes are EMOJI and MENTIONS, special entities selected from the PoS Tagging Selection
DOCUMENTS
Topic Modeling
Topic Modeling can be viewed as a specific factorization of the lexical matrix F(
probabilistic decomposition
)
F
1 . . . . . . . . . q
1
.
.
.
.
.
.
.
.
p
Φ
1 . . . . . K
1
.
.
.
.
.
.
.
.
p
Θ
1
.
.
.
K
1 . . . . . . . . . q
Terms
Topics
Documents DocumentsTopics
P(t|d) P(t|z) P(z|d)
Term distributions
per document
Term distributions
per topic
Topic distributions
per document
DOCUMENTS TERMS
LATENT
TOPICS
t1
t2
tp
d1
d2
dq
1
K
P(t|z)P(z|d)
Terms
Latent Dirichlet Allocation (LDA)
Blei
et al.
in 2003 proposed Latent Dirichlet Allocation (LDA), a generative probabilistic model designed
to uncover hidden topics in a large documents collection
In LDA, we infer topic distributions across a document collection, enabling topic prediction for new
documents based on their terms.
𝝱
q
𝝷jziti
p
𝞅k
K
𝝰
Proportion of topics for
a document dj
Proportion of terms in
topic k
Assignment of a topic k to a
term tibelonging to dj
Hyperparameter for a
priori document-topic
distribution
Hyperparameter for a
priori topic-term
distribution
Low
𝝰
: Fewer topics,
more generalized
High
𝝰
: More topics
per document
Low
𝝱
: Fewer terms, more
generalized
High
𝝱
: More defined topics
Pd,t = P d P 𝝷|
𝝰
z
P t z,𝞅P z 𝝷P𝞅
𝝱
Topic modeling K choice BBC NEWS
Setting parameters for finding the optimal
number of topics (K). Once the level of analysis
is specified, a metric for model tuning can be
selected, followed by, Terms and number of
topics to consider
BBC NEWS
Topic modeling Model Estimation
Topic by Words
Click here to view additional topics on the next page
BBC NEWS
Topic modeling Model Estimation
Parameters
If selected, the
number of topics is
determined based on
the optimal number
identified through the
CaoJuan method
BBC NEWS
Topic modeling Model Estimation
Topic by Docs
Clicking on a specific
document displays its full text
BBC NEWS
Topic modeling Model Estimation
Tables
Beta probability Theta probability
BBC NEWS
Topic modeling Model Estimation
Topic Correlations
Polarity Detection
Polarity detection aims to analyze opinions expressed as free texts written in natural language. It can
detect different emotional states and reveal a variety of behavioral and attitudinal patterns.
The term polarity is commonly used in Linguistics to distinguish affirmative and negative terms
(Giannakidou, 2012; Löbner, 2000). The overall evaluation of text polarity gives the sentiment
orientation of the text itself.
POSITIVENEUTRALNEGATIVE
-1
[-∞;0[
[-n;0[
+1
]0;+∞]
]0;+n]
0
Each term can have an
assigned polarity score
or an intensity level of
polarity
In addition to polarized terms,
valence shifters
can be considered to “adjust” the polarity:
Negators: terms like “notinvert polarity (e.g., not goodbecomes negative).
Amplifiers: terms like “veryincrease polarity strength (e.g., “very good” becomes stronger).
De-amplifiers: terms like “slightlydecrease polarity strength.
Polarized lexicons in TALL
Lexicons can be used to identify which terms are present in the analyzed text, allowing for polarity
calculation at the sentence or document level
Hu and Liu (2004) - Opinion Lexicon -One of the foundational resources for sentiment analysis in
consumer reviews. It categorizes words into positive and negative sentiment classes, focusing on
terms commonly found in product reviews.
Widely used in analyzing consumer sentiment, especially in e-commerce and review platforms, due
to its comprehensive list of opinion words.
Loughran and McDonald (2016) - Financial Sentiment Dictionary -Created specifically for financial
texts, it addresses the limitations of general-purpose sentiment dictionaries in financial contexts. It
includes categories such as “positive,” “negative,” “uncertainty,” “litigious,” andconstraining,”
tailored for finance-related sentiment analysis.
NRC Emotion Lexicon (Mohammad and Turney, 2010) -Developed by the National Research Council
(NRC) of Canada, this lexicon goes beyond simple polarity, capturing a wider range of emotions.
Suitable for detecting nuanced emotional tones in diverse text types, ranging from social media posts
to literature, due to its detailed emotion categories.
Lexicon-based polarity scores computing
The following steps are implemented in TALL for calculating polarity at a document level
Each term in the document is tagged if it appears in the sentiment lexicon (e.g., positive
terms are assigned +1, and negative terms -1)
For each polarized term
p
:
1. Determine the context (i.e., nearby terms surrounding the target term).
2. Count the number of negators n, amplifiers a, and de-amplifiers d in the term’s
context.
Polarity at a document level:
1. Aggregate polarity scores across terms in each document.
2. Normalization: Logistic transformation constrains polarity between -1 and +1.
Polarity detection BBC NEWS
Polarity detection
Parameters BBC NEWS
Polarity can be measured
across documents and
groups using one of the
available lexicons
Polarity detection
Top Words BBC NEWS
The distribution of top positive and
negative polarized words is displayed,
with the colors of the bars representing
the number of documents in which
words are categorized as very negative,
negative, neutral, positive, or very
positive.
Polarity detection
Table BBC NEWS
Summarization
1
2
3
SU M M A RY - TOP 4
41.
2.
3.
4.
IN P U T TEXT
EXTRACTIVE SUMMARIZATION
TALL employs an extractive summarization technique,
which involves selecting the most relevant sentences
from the document and organizing them systematically.
Sentences in the summary are directly sourced from the
original text.
The TextRank algorithm is used to identify these key
sentences by creating a network graph, where nodes
represent sentences and edges connect them based on a
similarity function. The Jaccard index is used to measure
the similarity between sentences. Sentences are then
ranked according to their centrality using the PageRank
algorithm, and the top-ranked sentences are included in
the final summary.
Text summarization aims to facilitate the task of reading and searching information in large documents producing concise
summaries that include relevant sentences while retaining all pertinent details from the text.
Summarization BBC NEWS
Summarize the text of a selected document. Sentences listed are ranked by their relevance
Summarization BBC NEWS
Content of the selected document
The summarization analysis provides a summary
of varying conciseness. Adjusting the slider from
More Concise to Less Concise will include more
sentences from the document's text.
Summarization
Full document and Table BBC NEWS
All sentences of the document are displayed,
with the most relevant ones highlighted
Document sentences ranked by their relevance
Future Directions and Key Enhancements
Integration with
ChatGPT API
Enhanced Representation and
Weighting Techniques
Dynamic and Temporal
Analysis
Enhanced Polarity Detection
and Emotional Analysis
Text Classification for
Document Categorization
Automated
Web Scraping
MASSIMO ARIA massimo.aria@unina.it
UNIVERSITÀ DEGLI STUDI DI NAPOLI FEDERICO II
CORRADO CUCCURULLO corrado.cuccurullo@unicampania.it
UNIVERSITÀ DEGLI STUDI DELLA CAMPANIA “LUIGI VANVITELLI
LUCA D’ANIELLO luca.daniello@unina.it
UNIVERSITÀ DEGLI STUDI DI NAPOLI FEDERICO II
MICHELANGELO MISURACA michelangelo.misuraca@unical.it
UNIVERSITÀ DELLA CALABRIA
MARIA SPANO maria.spano@unina.it
UNIVERSITÀ DEGLI STUDI DI NAPOLI FEDERICO II www.tall-app.com
Thank You
for your attention
References
Arun R., Suresh V., Veni Madhavan C.E. and Narasimha Murthy M.N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In
Mohammed J. Zaki, Jeffrey Xu Yu, B. Ravindran, and Vikram Pudi, editors, Advances in Knowledge Discovery and Data Mining, pages 391-402, Berlin, Heidelberg. Springer
Berlin Heidelberg.
Benzécri J.P. (1982). Histoire et préhistoire de l’analyse des données. Dunod, Paris.
Blei D.M., Ng A.Y. and Jordan M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993-1022.
Bolasco S. Morrone A. and Baiocchi F. (1999). A Paradigmatic Path for Statistical Content Analysis Using an Integrated Package of Textual Data Treatment, in M. Vichi, O.
Opitz (eds.), Classification and Data Analysis. Theory and Application, Springer-Verlag, Heidelberg, 237-246.
Callon M., Courtial J.-P., Turner W.A. and Bauin S. (1983). From translations to problematic networks: An introduction to co-word analysis. Social science information,
22(2):191-235.
Cao J., Xia T., Li J., Zhang Y. and Tang S. (2009). A density-based method for adaptive lda model selection. Advances in Machine Learning and Computational Intelligence.
Neurocomputing, 72(7):1775-1781.
Demsar J., Curk T., Erjavec A., Gorup C., Hocevar T., Milutinovic M., Mozina M., Polajnar M., Toplak M., Staric A., Stajdohar M., Umek L., Zagar L., Zbontar J., Zitnik M. and
Zupan B. (2013). Orange: Data Mining Toolbox in Python, Journal of Machine Learning Research 14(Aug): 2349-2353.
Deveaud R., Sanjuan E. and Bellot P. (2014). Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique, 17:61-84, 06.
Devlin J., Chang M.W., Lee K. and Toutanova K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Fortunato S. and Hric D. (2016). Community detection in networks: A user guide. Physics Reports, 659:1-44, nov.
Giannakidou, A. (2012). Negative and positive polarity items. In K. von Heusinger, C. Maienborn, & P. Portner (Eds.), Semantics: An international handbook of natural
language meaning (pp. 16601712). Berlin, D: De Gruyter Mouton Vol. 2 of HandBOOKS OF LINGUISTICS AND COMMUNICATION Science.
Griffiths T.L. and Steyvers M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228-5235.
Hinton G.E. (1986). Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society (1), 12.
Hu M. and Liu B. (2004). Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining,KDD 04, 168-177, New York, NY, USA. Association for Computing Machinery.
References
Jain A.K., Murty M.N. and Flynn P.J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3):264-323.
Lebart L. and Morineau A. (1982). SPAD: Système Portable pour l’Analyse des Données.
Lebart L., Morineau A. and Bécue Bertaut M. (1989). Spad.T: Système portable pour l’analyse des données textuelles.
Lebart L., Salem A. and Berry L. (1997). Exploring textual data, volume 4. Springer Science & Business Media.
Löbner, S. (2000). Polarity in natural language: Predication, quantification and negation in particular and characterizing sentences. Linguistics and Philosophy, 23, 213308.
Loughran T. and McDonald B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4):1187-1230.
Mihalcea R. and Tarau P. (2004). TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404-411,
Barcelona, Spain, July. Association for Computational Linguistics.
Mikolov T., Le Q.V. and Sutskever I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
Misuraca M. and Spano M. (2020). Unsupervised Analytic Strategies to Explore Large Document Collections. Heidelberg: SPRINGER, 06, 17-28.
Mohammad S. and Turney P. (2010). Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT
2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, 26-34, Los Angeles, CA, June. Association for Computational Linguistics.
Page L., Brin S., Motwani R. and Winograd T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies
Project.
Ratinaud P. (2009). IRAMUTEQ: Interface de R pour Les analyses multidimensionnelle de textes et de questionnaires [computer software]. Available at:http://www.iramuteq.org
Reinert M. (1983). Une méthode de classification descendante hiérarchique: application à l'analyse lexicale par contexte. Les cahiers de l'analyse des données, 8(2): 187-198.
http://www.numdam.org/item/CAD_1983__8_2_187_0/
Rose S., Engel D., Cramer N. and Cowley W. (2010). Automatic keyword extraction from individual documents, Wiley Online Library, 1-20
Salton G. and Buckley C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management,24(5):513-523.
Sinclair, S. and Rockwell G. (2016). Voyant Tools. Web. http://voyant-tools.org
Xu D. and Tian Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science,2:165-193.